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I. INTRODUCTION 



Principal component analysis (PCA) |MKB79J| is a key 
method in the statistical engineering toolbox. It is well 
over a century old, and is used in many different ways. 
PCA is also known as the Karhiinen-Loeve transform 
or Hotelling transform in image analysis, and a varia- 
tion is latent semantic analysis (LSA) in text analysis 
pI)DL+9Qj . It is a kind of eigen-analysis since it ma- 
nipulates the eigen-spectrum of the data matrix. It is 
usually applied to measurements and real valued data, 
and used for feature extraction or data summarization. 
LSA might not perform the centering step (subtracting 
the mean from each data vector prior to eigen-analysis) 
on the word counts for a document to preserve matrix 
sparseness, or might con vert the word counts to real- 
valued tf*idf |B YRNQflJ I . The general approach here 
is data reduction. 

Independent component analysis (ICA, see (HKOOlj ) 
is in some ways an extension of this general approach, 
however it also involves the estimation of so-called latent, 
unobservable variables. This kind of estimation follows 
the major statistical methodology that deals with gen- 
eral unsupervised methods such as clustering and factor 
analysis. The general approach is called latent .structure 
analysis, which is more recent, perhaps half a century old. 
The data is modelled in a way that admits unobservable 
variables, that influence the observable variables. Statis- 
tical inference is used to "reconstruct" the unobservable 
variables from the data jointly with general characteris- 
tics of the unobservable variables themselves. This is a 
theory with particular assumptions (i.e., a "model"), so 
the method may arrive at poor results. 

Relatively recently the statistical computing and ma- 
chine learning community has become aware of seemingly 
similar approaches for discrete observed data that ap- 
pears under many names. The best known of these in 



this community are probabilistic latent semantic indexing 
(PLSI) ;Hof99iJ , non- negative matrix factorisation (NMF) 
|LS99l| and latent Dirichlet allocation (LDA) BNJO||. 
Other variations are discussed later in Section[3 We re- 
fer to these methods jointly as Discrete Component Anal- 
ysis (DC A), and this article provides a unifying model for 
them. 

All the above approaches assume that the data is 
formed from individual observations (documents, indi- 
viduals, images), where each observation is described 
through a number of variables (words, genes, pixels). All 
these approaches attempt to summarize or explain the 
similarities between observations and the correlations be- 
tween variables by inferring latent variables for each ob- 
servation, and associating latent variables with observed 
variables. 

These methods are applied in the social sciences, de- 
mographics and medical informatics, genotype inference, 
text and image analysis, and information retrieval. By 
far the largest body of applied work in this area (us- 
ing citation indexes) is in genotype inference due to 
the Structure program [PSDOO] . A growing body of 
work is in tex t classification and topic modelling (see 
|CS04LIBPT 04'I'). and language modelling in information 
retrieval (see AGvR03, BJ04, Can04j). As a guide, ar- 
gued in the next section, the methods apply when PCA 
or ICA might be used, but the data is discrete. 

Here we present in Section IIIII a unified theory for 
analysis of components in discrete data, and compare 
the methods with related techniques in Section The 
main families of algorithms discussed in Section IVIII are 
a variational approximation, Gibbs sampling, and Rao- 
Blackwellised Gibbs sampling. Applications are pre- 
sented in Section rVlIII for voting records from the United 
States Senate for 2003, and the use of components in 
subsequent classification. 



II. VIEWS OF DCA 
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One interpretation of the DCA methods is that they 
are a way of approximating large sparse discrete matri- 



ces. Suppose we have a 500, 000 documents made up of 
1,500,000 different words. A document such as a page 
out of Dr. Seuss's The Cat in The Hat, is first given as a 
sequence of words. 

So, as fast as I could, 1 went after my net. 
And I said, "With my net I can bet them I 
bet, I bet, with my net, 1 can get those Things 
yet!" 

It can be put in the bag of words representation, where 
word order is lost. This yields a hst of words and their 
counts in brackets: 

after(l) and(l) as(2) bet(3) can(2) could(l) 
fast(l) gct(l) 1(7) my(3) net(3) said(l) so(l) 
thcm(l) things(l) those(l) went(l) with(2) 
yct(l) . 

Ahhough the word 'you' never appears in the original, we 
do not include 'you (0)' in the representation since zeros 
are suppressed. This sparse vector can be represented as 
a vector in full word space with 1, 499, 981 zeroes and the 
counts above making the non-zero entries in the appro- 
priate places. Given a matrix made up of rows of such 
vectors of non-negative integers dominated by zeros, it is 
called here a large sparse discrete matrix. 

Bag of words is a basic representation in information 
retrieval |BYRN99| . The alternative is a sequence of 
words. In DCA, either representation can be used and 
the models act the same, up to any word order effects in- 
troduced by incremental algorithms. This detail is made 
precise in subsequent sections. 

In this section, we argue from various perspectives that 
large sparse discrete data is not well suited to standard 
PGA or IGA methods. 



A. Issues with PCA 

PGA has been normally applied to numerical data, 
where individual instances are vectors of real numbers. 
However, many practical datasets are based on vectors 
of integers, non-negative counts or binary values. For ex- 
ample, a particular word cannot have a negative number 
of appearances in a document. The vote of a senator 
can only take three values: Yea, Nay or Not Voting. We 
can transform all these variables into real numbers using 
tf *idf , but this is a linear weighting that does not affect 
the shape of a distribution. 

With respect to modelling count data in linguis- 
tic appl ications. Dunning makes the following warning 
|Dun94^ : 

Statistics based on the assumption of normal 
distribution are invalid in most cases of sta- 
tistical text analysis unless either enormous 
corpora are used, or the analysis is restricted 
to only the very most common words (that 
is, the ones least likely to be of interest). 



This fact is typically ignored in much of the 
work in this field. Using such invalid methods 
may seriously overestimate the significance 
of relatively rare events. Parametric statis- 
tical analysis based on the binomial or multi- 
nomial distribution extends the applicability 
of statistical methods to much smaller texts 
than models using normal distributions and 
shows good promise in early applications of 
the method. 

While PGA is not always considered a method based on 
Gaus sians, it can b e justified using Gaussian distribu- 
tions | Row9 8. TB99] . Moreover, PGA is justified using a 
least squares distance measure, and most of the proper- 
ties of Gaussians follow from the distance measure alone. 
Rare events correspond to points far away under an L2 
norm. 

Fundamentally, there are two different kinds of large 
sample approximating distributions that dominate dis- 
crete statistics: the Poisson and the Gaussian. For in- 
stance, a large sample binomial is approximated as a 
Poisson [y] when the probability is small and as a Gaus- 
sian otherwise tRos89]. Figure ^ illustrates this by show- 
ing the Gaussian and Poisson approximations to a bino- 
mial with sample size iV = 100 for different proportions 
{p = 0.03,0.01,0.03). Plots are done with probability 
in log scale so the errors for low probability values are 
highlighted. One can clearly see the problem here: the 
Gaussian provides a reasonable approximate for medium 
values of the proportion p but for small values it severely 
underestimates low probabilities. When these low prob- 
ability events occur, as they always will, the model be- 
comes distorted. 

Thus in image analysis based on analogue to digital 
converters, where data is counts, Gaussian errors can 
sometimes be assumed, but the Poisson should be used if 
counts are small. DGA then avoids Gaussian modelling 
of the data, using a Poisson or multinomial directly. 

Another critique of the general style of PGA comes 
from the psychology lite rature , this time it is used as 
a justification for DGA |GS02| |. Griffiths and Steyvers 
argue against the least squares distance of PGA: 

While the methods behind LSA were novel in 
scale and subject, the suggestion that similar- 
ity relates to distance in psychological space 
has a long history (Shepard, 1957). Grit- 
ics have argued that human similarity judg- 
ments do not satisfy the properties of Eu- 
clidean distances, such as symmetry or the 
triangle inequality. Tversky and Hutchinson 
(1986) pointed out that Euclidean geometry 
places strong constraints on the number of 
points to which a particular point can be the 
nearest neighbor, and that many sets of stim- 
uli violate these constraints. 

They also considered power law arguments which PGA 
violates for associated words. 
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FIG. 1: Gaussian and Poisson approximations to a binomial. 



B. Component Analysis as Approximation 



C. Independent Components 



In the data reduction approach for PCA, one seeks to 
reduce each J-dimensional data vector to a smaller K- 
dimensional vector. This can be done by approximating 
the full data matrix as a product of smaller matrices, 
one representing the reduced vectors called the compo- 
nent/factor score matrix, and one representing a data in- 
dependent part called the component/factor loading ma- 
trix, as shown in Figure |21 In PCA according to least 
squares theory, this approximation is made by eliminat- 
ing the lowe r-order e igenvectors, the least contributing 
components |MKB79f . 

If there are I documents, J words and K components, 
then the matrix on the left has I * J entries and the 
two matrices on the right have {I + J) * K entries. This 
represents a simplification when K <^ I, J. We can view 
DCA methods as seeking the same goal in the case where 
the matrices are sparse and discrete. 

When applying PCA to large sparse discrete matrices, 
or LSA using word count data interpretation of the com- 
ponents, if it is desired, be comes diffi cult (it was not a 
goal of the original method |DDL"'"90l| '). Negative values 
appear in the component matrices, so they cannot be 
interpreted as "typical documents" in any usual sense. 
This applies to many other kinds of sparse discrete data: 
low intensity images (such as astronomical images) and 
verb-noun data used in language models introduced by 
|PTL93| . for instance. 

The cost function being minimized then plays an im- 
portant role. DCA places constraints on the approxi- 
mating score matrix and loading matrix in Figure |21 so 
that they are also non-negative. It also uses an entropy 
distance instead of a least squares distance. 



Independent component analysis (ICA) was als o devel- 
oped as an alternative to PCA. Hyvanen and Oja |HO0Cl| | 
argue that PCA methods merely find uncorrelated com- 
ponents. ICA then was developed as a way of repre- 
senting multivariate data with truly independent compo- 
nents. In theory, PCA approximates this also if the data 
is Gaussian [TB99|, but in practice it rarely is. 

The basic formulation is that a ii'-dimensional data 
vector u; is a linear invertible function of K independent 
components represented as a isT-dimensional latent vec- 
tor I, w = &l for a square invertible matrix 0. Note 
the ICA assumes J — K in our notation. plays the 
same role as the loading matrix above. For some univari- 
ate density model U, the independent components are 
distributed as p{l \ U) = W^pih \ U), thus one can get 
a likelihood formula p{w \ 0, U) using the above equality 

The Fast ICA algorithm |HO00l | can be interpreted 
as a maximum likelihood approach based on this model 
and likelihood formula. In the sparse discrete case, how- 
ever, this formulation breaks down for the simple rea- 
son that w is mostly zeros: the equation can only hold 
if I and are discrete as well and thus the gradient- 
based algorithms for ICA cannot be justified. To get 
around t his in practice, when applying ICA to documents 
|BKG03| . wor d counts a re sometimes first turned into 
tf *idf scores |BYRN99l |. 

To arrive at a formulation more suited to discrete data, 
we can relax the equality in ICA (i.e., w = &l) to be an 
expectation: 



E„ 



-^pi-w\l.,U) 



[w] = &l 



We still have independent components, but a more ro- 
bust relationship between the data and the score vector. 
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FIG. 2: The matrix approximation view 



Co rrespondence between ICA and DCA has been noted 
in |B J04l ICan04J . With this expectation relationship, the 
dimension of I can now be less than the dimension of w, 
K < J, and thus would be a rectangular matrix. 



III. THE BASIC MODEL 

A good introd uction to these models from a number of 
viewpoints is by |BN.Tn3l Ic^aiinl lB,Tn4J . Here we present 
a general model. The notation of words, bags and docu- 
ments will be used throughout, even though other kinds 
of data representations also apply. In statistical termi- 
nology, a word is an observed variable, and a document 
is a data vector (a list of observed variables) represent- 
ing an instance. In machine learning terminology, a word 
is a feature, a bag is a data vector, and a document is 
an instance. Notice that the bag collects the words in 
the document and loses their ordering. The bag is repre- 
sented as a data vector w. It is now J-dimensional. The 
latent, hidden or unobserved vector I called the compo- 
nent scores is ii'-dimensional. The term component is 
used here instead of topic, factor or cluster. The pa- 
rameter matrix is the previously mentioned component 
loading matrix 0, and is J x K. 

At this point, it is also convenient to introduce the 
symbology used throughout the paper. The symbols 
summarised in Table Q] will be introduced as we go. 



A. Bags or Sequences of Words? 

For a document x represented as a sequence of words, 
if ii; = bag(a;) is its bagged form, the bag of words, repre- 
sented as a vector of counts. In the simplest case, one can 
use a multinomial with sample size L = \x\ and vocab- 
ulary size J — \w\ to model the bag, or alternatively L 
independent discrete distributions |^ with J outcomes to 
model each xi. The bag w corresponds to the sequence 

X with the order lost, thus there are -h-r — n^ different 

sequences that map to the same bag w. The likelihoods 
for these two simple models thus differ by just this com- 
binatoric term. 



Note that some likelihood based methods such as maxi- 
mum likelihood, some Bayesian methods, and some other 
fitting methods (for instance, a cross validation tech- 
nique) use the likelihood as a black-box function. They 
take values or derivatives but otherwise do not further in- 
teract with the likelihood. The combinatoric term map- 
ping bag to sequence representations can be ignored here 
safely because it does not affect the fitting of the pa- 
rameters for M.. Thus for these methods, it is irrelevant 
whether the data is treated as a bag or as a sequence. 
This is a general property of multinomial data. 

Thus, while we consider bag of words in this article, 
most of the theory applies equally to the sequence of 
words representation |j]. Implementation can easily ad- 
dress both cases with little change to the algorithms, just 
to the data handling routines. 



B. General DCA 

The general formulation introduced in Section FlI CI is 
an unsupervised version of a linear model, and it applies 
to the bag of words w as 



E, 



■W'-^p{'w\l.&) 



\w] = @l 



(1) 



The expected value (or mean) of the data is given by 
the dot product of the component loading matrix and 
some latent component scores I. ^^_^^^ 

In full probability (or Bayesian) modelling |GCSR95l| . 
we are required to give a distribution for all the non- 
deterministic values in a model, including model param- 
eters and the latent variables. In likelihood modelling 
CB90J, we are required to give a distribution for all the 
data values in a model, including observed and latent 
variables. These are the core methodologies in computa- 
tional statistics, and most others extend these two. The 
distribution for the data is called a likelihood in both 
methodologies. 

The likelihood of a document is the primary way of 
evaluating a probabilistic model. Although likelihood is 
not strictly a probability in classical statistics, we can in- 
terpret them as a probability that a probabilistic model 
Ai would generate a document x, P{x\J^). On the other 
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number of documents 

subscript to indicate document, sometimes dropped 

number of different words, size of the dictionary 

number of components 

number of words in document i 

number of words in the collection, ^^ L(i) 

vector of J word counts in document i, row totals of V , entries w-,,(i) 

vector of K component counts for document i, column totals of V 

matrix of word counts per component, dimension J x K, entries Vj^k 

vector of K component scores for document i, entries lk,(i) 

l(^i) normalised, entries mi^^^i-) 

vector of L(i) sequential component assignments for the words in 

document i, entries fc;,(i) € [1, . . . , A"] 

component loading matrix, dimension J x K, entries 9j^k 

component loading vector for component k, a column of 

ii'- dimensional parameter vectors for component priors 



TABLE I: Summary of major symbols 



hand, it is also a way of determining whether the docu- 
ment is usual or unusual: documents with low likelihood 
are often considered to be outliers or anomalies. If we 
trust our documents, low likelihoods indicate problems 
with the model. If we trust out model, a low likelihood 
indicates problems with a document. 

Thus to complete the above formulation for DCA, we 
need to give distributions matching the constraint in 
Equation Q, to specify the likehhood. Distributions are 
needed for: 

• how the sequence x or bag w is distributed given 
its mean &l formed from the component loading 
matrix, 

• how the component scores I are distributed, 

• and if full probability modelling is used, how the 
component loading matrix is distributed apriori, 
as well as any parameters. 

The formulation of Equation (Q is als o called an ad- 
mixture model in the statistical literature 'PSD00l| . This 
is in contrast with a mixture model |GCSR,95] which uses 
a related constraint 



E„ 



-^p(iu|Z) 



e. 



for some latent variable k representing the single latent 
component for w. Since k is unobserved, this also cor- 
responds to making a weighted sum of the probability 
distributions for each 6. k- 



IV. THE MODEL FAMILIES 

This section introduces some forms of DCA using spe- 
cific distributions for the sequence x or bag w and the 
component scores I. The fundamental model here is the 
Gamma-Poisson Model (GP model for short). Other 



models can be presented as variations. The probabil- 
ity for a document is given for each model, both for the 
case where the latent variables are known (and thus are 
on the right-hand side), and for the case where the latent 
variables are included in the left-hand side. 



A. The Gamma-Poisson Model 

The g eneral G amma-Poisson form of DCA, introduced 
as GaP |Can04J is now considered in more detail: 



• Document data is supplied in the form of word 
counts. The word count for each word type is Wj. 

Uj. 



Let L be the total count, so i = ^. 



Wi 



• The document also has component scores I that in- 
dicate the amount of the component in the docu- 
ment. These are latent or unobserved. The entries 
Ik are independent and gamma distributed 



Ik ^ Gamma(afc,/3A:) 



for fc = 1, 



,K. 



The /3k affects scaling of the components 5] , while 
ak changes the shape of the distribution, shown in 
Figure 131 

• There is a component loading matrix & of size JxK 
with entries Oj^k that controls the partition of fea- 
tures amongst each component. In the matrix, each 
column for component k is normalised across the 
features, meaning that ^ 9j,k — 1- Thus each col- 
umn represents the proportions of words/features 
in component k. 

• The observed data w is now Poisson distributed, 
for each j 
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FIG. 3: Gamma distribution for different values of q^. 



• The K parameters {at, Pk) to the gamma distribu- 
tions give iiT-dimensional parameter vectors a and 
/3. Initially these vectors will be treated as con- 
stants, and their estimation is the subject of later 
work. 

• When using Bayesian of full probability modelling, 
a prior is needed for 0. A Dirichlet prior can be 
used for each k-th component of with J prior 
parameters 7j, so 9.k ~ Dirichletj('y). In prac- 
tice we use a Jeffreys' prior, which has 7j = 0.5. 
The use of a Dirichlet has no strong justification 
other than being conjugate GCSR95], but the Jef- 



J 



freys' prior has some minimax properties [CB94| 
that make it more robust. 



The hidden or latent variables here are the compo- 
nent scores I. The model parameters are the gamma 
parameters (3 and a, and the component loading 
matrix 0. Denote the model as GP standing for 
Gamma-Poisson. The full likelihood for each document, 
p{w, l\ (3,a, 0, K, GP), is composed of two parts. The 
first part comes from K independent gamma distribu- 
tions for Ik, and the second part comes from J indepen- 
dent Poisson distributions with parameters J^k ^kdj,k- 



likelihood of I 



likelihood of w given I 



n 



p^^lt'-'eM-Pklk} TT (Efe hOj.kr exp {- (Ek lk0,,k)} 



r(afc) 



n 



(2) 



B. The Conditional Gamma-Poisson Model 



otherwise gamma distributed with probability 1 — pk- 



In practice, when fitting the parameters a in the GP 
or DM model, it is often the case that the ak go very 
small. Thus, in this situation, perhaps 90% of the com- 
ponent scores Ik are negligible, say less than 10~^ once 
normalised. Rather than maintaining these negligible 
values, we can allow component scores to be zero with 
some finite probability. The Conditional Gamma-Poisson 
Model, denoted CGP for short, introduces this capabil- 
ity. In retrospect, CGP is a sparse GP with an additional 
parameter per component to encourage sparsity 

The CGP model extends the Gamma-Poisson model 
by making the Ik zero sometimes. In the general case, 
the Ik are independent and zero with probability pk and 



Gamma(afe, f3k 



for fc = 1, . 



,K. 



Denote the model as CGP standing for Condi- 
tional Gamma-Poisson, and the full likelihood is now 
p{w, l\l3,a, p, 0, JsT, CGP). The full hkelihood for each 
document, modifying the above Equation (j^J, replaces 
the term inside Yik with 



(1 - Pk) 



/3r^r 



^exp{-Pklk} 



r(afe) 



+ Pk^h^o (3) 



C. The Dirichlet-Multinomial Model 

The Dirichlet-multinomial form of DCA was intro- 
duced as MPCA. In this case, the normalised latent vari- 
ables m are used, and the total word count L is not 
modelled. 



TABLE II: Previously Published Models 



m ^ Dirichletx(cK) , 



w ^ Multinomial(L, 0m) 



The first argument to the multinomial is the total count, 
the second argument is the vector of probabilities. De- 
note the model as DM, and the full likelihood is now 
p(i(;,m I L, a, 0, X, DM). The full likelihood for each 
document becomes: 



'' ~K?"')nC, 



HE 



mkVj,k 



(4) 

where C^ is L choose wi , . . . , wj . This model can also 
be derived from the Gamma-Poisson model, shown in the 
next section. 



D. A Multivariate Version 

Another variation of the methods is to allow group- 
ing of the count data. Words can be grouped into sepa- 
rate variable sets. These groups might be "title words," 
"body words," and "topics" in web page analysis or 
"nouns," "verbs" and "adjectives" in text analysis. The 
groups can be treated with separate discrete distribu- 
tions, as below. The J possible word types in a document 
are partitioned into G groups i3i, . . . , Bq- The total word 
counts for each group g is denoted Lg = J2jeB 



Wj . If the 



vector w is split up into G vectors Wg — {wj : j Cz Bg}, 
and the matrix is now normalised by group in each 
row, so X^tpb ^j,k — li then a multivariate version of 
DCA is created so that for each group g, 



w„ 



Multinomial L„ 



'mkfj,k 



J^Bg 



Fitting and modelling methods for this variation are re- 
lated to LDA or MPCA, and will not be considered in 
more detail here. This has the advantage that different 
kinds of words have their own multinomial and the dis- 
tribution of different kinds is ignored. This version is 
demonstrated subsequently on US Senate voting records, 
where each multinomial is now a single vote for a partic- 
ular senator. 



V. RELATED WORK 

These sections begins by relating the main approaches 
to each other, then placing them in the context of ex- 
ponential family models, and finally a brief history is 
recounted. 



Name 


Bagged 


Components 


p{x/w 1 0, 1) 


p{l/m) 


NMF fLS99] 


yes 


I 


Poisson 


NA 


PLSI fHof99] 


no 


m 


discrete 


NA 


LDA fBNJOS] 


no 


m 


discrete 


Dirichlet 


MPCA [Bun02l 


yes 


m 


multinomial 


Dirichlet 


GaP fCan04] 


yes 


I 


Poisson 


gamma 



A. Correspondences 

Various published cases of DCA can be represented in 
terms of this format, as given in Table ITU A multino- 
mial with total count L and J possible outcomes is the 
bagged version of L discrete distributions with J possible 
outcomes. In the table, NA indicates that this aspect of 
the model was not required to be specified because the 
methodology made no use of it. Note that NMF used 
a cost function formulation, and thus avoided defining 
likelihood models. It is shown later that its cost func- 
tion corresponds to a Gamma-Poisson with parameters 
a = f3 = (i.e., all zero). 

LDA has the multinomial of MPCA replaced by a se- 
quence of discrete distributions, and thus the choose term 
drops, as per Section Fill Al PLSI is related to LDA but 
lacks a prior distribution on m. It does not model these 
latent variables using full probabilit y theory , but instead 
using a weighted likelihood method HofQO] . Thus PLSI 
is a non-Bayesian version of LDA, although its weighted 
likelihood method means it accounts for over-fitting in a 
principled manner. 

LDA and MPCA also have a close relationship to GaP 
(called GP here). If the parameter a is treated as known 
and not estimated from the data, and the /3 parameter 
vector has the same value for each Pk, then L is apos- 
teriori independent of m and 0. In this context LDA, 
MPCA and GaP are equivalent models ignoring repre- 
sentational issues. 

Lemma 1. Given a Gamma-Poisson model of Sec- 
tion \IVA\ where the f3 parameter is a constant vector 
with all entries the same, (3, the model is equivalent 
to a Dirichlet-multinomial model of Section MV (A where 
iTT-k = Ik/ X]fe ^k, o,nd in addition 



Poisson- Ga 



Proof. Consider the Gamma-Poisson model. The sum L = 
^ . Wj of Poisson variables w has the distribution of a Poisson 
with parameter given by the sum of their means. When the 
sum of Poisson variables is known, the set of Poisson variables 
has a multin omial d istribution conditioned on the sum (the 
total count) |Ros89(| . The Poisson distributions on w then is 
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equivalent to: 



L ~ Poisson 2^, h 
\ k 



w ~ Multinomial ( L 



-&l 



Moreover, if the /3 parameter is constant, then mt = Ik/ X^fc ^k 
is distributed as Dirichletx (a), and "^j^lk is distributed in- 
dependently as a Gamma(^j. ak, 13). The second distribution 
above can then be represented as 

w ~ Multinomial (L, ©m) . 

Note also, that marginalising out X^fc^fc convolves a Poisson 
and a gamma distribution to produce a Poisson-Gamma dis- 
tribution for L ;BS94]. D 

If a is estimated from the data in GaP, then the pres- 
ence of the observed L will influence ct, and thus the other 
estimates such as of 0. In this case, LDA and MPCA will 
no longer be effectively equivalent to GaP. Note, Canny 
recomm ends fixing a. and estimating (3 from the data 
|Can04| . 

To complete the set of correspondences, note that in 
Section IVII Al it is proven that NMF corresponds to a 
maximum likelihood version of GaP, and thus it also 
corresponds to a maximum likelihood version of LDA, 
MPCA, and PLSI. 



tributed to the development of the DCA family of meth- 
ods. Some original research includes the following: grade 
of membership (GOM) WM82|, probabilistic latent se- 
mantic indexing (PLSI) Hof99|, non- negative matrix 
factorisation (NMF ) LS99] . genotype inference using 
admixtu res |PSDO0| . latent Dirichlet allocatio n (LDA ) 
|BNJ03 ^. and Gamma-Poisson models (GaP) |Can04| . 
Modifications and algorithm s have a lso been explored as 
multinomial PCA (MPCA) |Bun02l | and multiple aspect 
modelling jML02l | . Several of these models have been in- 
terpreted as instance s of a m ore general family of mixed- 
membership models |EFL04| . 

The first clear enunciation of the large-scale model in 
its Poisson form c omes f rom ILS9 9I. and in its multi- 
nomial form from 



| Hof99| and |PSDOq| . The first clear 
expression of the problem as a latent variable problem 
is given by PSDOOJ. The relationship between LDA and 
PLSI and that NMF was a Poisson version of LDA was 
first pointed out by Bun02j , and proven in [GG05] . The 
connections to IC A come from |BJ04] and Can04j . The 
general Gamma-Poisson formulation, p erhaps the final 
generalisation to this line of work, is in |Can04J . 

Related techniques in the statistical community can 
be traced back to Latent Class Analysis developed in the 
1950's, and a rich theory has since developed relating the 
methods to corresp ondence analysis and other statistical 
techniques 



I corresp c 
lvGv99| . 



B. Notes on the Exponential Family 



For the general DCA model of Section IIII Bl when 
p{w I QZ) is in t he so-called exponential family distribu- 
tions |GCSR95l |. the expected value of w is referred to 
as the dual parameter^ and it is usually the parameter 
we know best. For the Bernoulli with probability p, the 
dual parameter is p, for the Poisson with rate A, the dual 
parameter is A, and for the Gaussian with mean /x, the 
dual parameter is the mean. Our formulation, then, can 
be also be interpreted as letting w be exponential fam- 
ily with dual parameter given by (©O- Our formulation 
then ge neralises PCA in the same way that a linear model 
|MN89J| generalises linear regression. ^^_^^ 

Note, an alternative has also been presented |CDSOl| 
where w has an exponential family distribution with nat- 
ural parameters given by (0i). For the Bernoulli with 
probability p, the natural parameter is log(p/(l — p)), for 
the Poisson with rate A, the natural parameter is log A 
and for the Gaussian with mean /i, the natural parame- 
ter is the mean. This formulation generali ses PCA in the 
same way that a generalised linear model |MN89| gener- 
alises linear regression. 



C. Historical notes 

Several independent groups within the statistical com- 
puting and machine learning community have con- 



VI. COMPONENT ASSIGNMENTS FOR 
WORDS 

In standard mixture models, each document in a col- 
lection is assigned to one latent component. The DCA 
family of models can be interpreted as making each word 
in each document be assigned to one latent component. 
To see this, we introduce another latent vector which rep- 
resents the component assignments for different words. 
As in Section lillAI this can be done using a bag of com- 
ponents or a sequence of components representation, and 
no effective change occurs in the basic models, or in the 
algorithms so derived. What this does is expand out the 
term &l into parts, treating it as if it is the result of 
marginalising out some latent variable. 

We introduce a i^-dimensional discrete latent vector c 
whose total count is L, the same as the word count. The 
count Ck gives the number of words in the document ap- 
pearing in the fc-th component. Its posterior mean makes 
a good diagnostic and interpretable result. A document 
from the sports news might have 50 "football" words, 10 
"German" words, 20 "sports fan" words and 20 "general 
vocabulary" words. 

This latent vector is derived from a larger latent ma- 
trix, Y of size J X K and entries Vj^k- This has row to- 
tals Wj as given in the observed data and column totals 
Cfe. Vectors w and c are these word appearance counts 
and component appearance counts, respectively, based on 
summing rows and columns of matrix V. This is shown 



K components 



i 



' ^1,1 Wl,2 • ■ ■ Vx^K \ 
^2,1 W2,2 ■ • • V2,K 



,\vj,l Vj^2 



Vj.K / 



■W2 



Wj 



Cl C2 



Ck 



FIG. 4: A representation of a document as a contingency 
table. 



in Figure^ 

The introduction of the latent matrix V changes the 
forms of the hkclihoods, and makes the development and 
analysis of algorithms easier. This section catalogues 
the likelihood formula, to be used when discussing al- 
gorithms. The choices of different statist ical word count 
models have been recently evaluated by |ACF05l | in the 
context of document classification. 



A. The Gamma-Poisson Model 

With the new latent matrix V, the distributions un- 
derlying the Gamma-Poisson model become: 



Ik ^ Gamma(afc,/3fc) 
Cfe ~ Poisson(^fe) 



(5) 



Wi 



— y ^j,k, where Wj^fe ~ Multinomial (cfc, 0.^fc) 



The joint likelihood for a document, 
p{V,l\P,cy.,@,K,GP) (the w are now derived quan- 
tities so not represented), thus becomes, after some 
rearrangement 



n 



k j,k ■" 

Note that I can be marginalised out, yielding 



n 



n 



^3,k 



Tjck + ak) PT 

V{ak) {(3k + 1)=^+"" ^^l vj^kl 



(6) 



(7) 



and the posterior mean of Ik given c is {ck + ak) / {1 + f3k) ■ 
Thus each Ck '^ Poisson-Gamma(afc,/3fc, 1). 



B. The Conditional Gamma-Poisson Model 

The likelihood follows the GP case, except that with 
probability pk, h = and thus Ck = 0. The joint like- 
lihood, p{V,l\ f3,a, p,&, K,CGP), thus becomes, after 



some rearrangement 



ni(i-p.)i '''''''""yr"'"'"^' n'^ 




Note that I can be marginalised out, yielding 



r(afc) ipk + 1)^'=+"^ 



+ Pfclcfc=0 



n 



Q^'j.fc (8) 



Wj.fc! 



The Oj^k can be pulled out under the constraint ^ ■ Vj^k = 
Ck- The posterior mean of Ik given c is (1 — Pk){ck + 
ak)/{l + l3k)- 



C. The Dirichlet-Multinomial Model 

For the Dirichlet-multinomial model, a similar recon- 
struction applies: 

m ^ DirichletA'(Q!) (9) 

Ck ~ Multinomial (L, m) 

Wj = /^i'j,fc, where iij^fe ~ Multinomial (cfe, 0.,fc) . 

k 

The joint likelihood, p{V ,m\a.,@,K,T)M), thus be- 
comes, after some rearrangement 



xirK:^. n 



n 



"j-k 



\k / k r(afc) ;; v,.k'. 

Again, m can be marginalised out yielding 



L! 



n 



r(Cfc +ak) TT ^/,k 



r(i + Efcafe)V r(afc) ]^V3^k\ 



n 



(10) 



(11) 



VII. ALGORITHMS 

In developing an algorithm, the standard approach is 
to match an optimization algorithm to the functional 
form of the likelihood. When using Bayesian or some 
other statistical methodology, this basic approach is usu- 
ally a first step, or perhaps an inner loop for some more 
sophisticated computational statistics. 

The likelihoods do not yield easily to standard EM 
analysis. To see this, consider the forms of the likelihood 
for a single document for the GP model, and consider the 
probability for a latent variable z given the observed data 
w, p{z I w, (3, a, 0, K, GP). For EM analysis, one needs 
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to be able to compute V.z^p{z\w,&,...) [^ogp{w, z\@, . . .)]. 
There are three different forms of the HkeUhood seen so 
far depending on which latent variables z are kept on the 
left-hand side of the probability: 

p{wJ\f3,a,&,K,GP): from Equation has the 
term (X]fe ^k&j.k) ' , which means there is no known 
simple posterior distribution for I given w. 

p(i(;,Z, y 1/3, a,0,/ir, GP): from Equation 10 has the 
term ;^'=+"'=~ which links the two latent vari- 
ables I and V , and prevents a simple evaluation 
of Ej_v[t'j_fe] as required for the expected log prob- 
ability. 

p{w,V \l3,a,@,K,GT?): from Equation Q has the 
term V{ck + ak) (where Ck = ^jVj,k), which 
means there is no known simple posterior distri- 
bution for V given w. 

Now one could always produce an EM-like algorithm 
by separately updating I and V in turn according to 
some mean formula, but the guarantee of convergence 
of to a maximum posterior or likelihood value will 
not apply. In this spirit earlier authors point out that 
EM-like principles apply and use EM terminology since 
EM methods would apply if I was observed iSj . For the 
exponential family, which this problem is in, the vari- 
ational approximation algorithm with KuUback-Leibler 
diverg ence correspon ds to an extension of the EM algo- 
rithm |GBOCt lBun02J | . This variational approach is cov- 
ered below. 

Algorithms for this problem follow some general ap- 
proaches in the statistical computing community. Three 
basic approaches are presented here: a variational ap- 
proximation, Gibbs sampling, and Rao-Blackwellised 
Gibbs sampling. A maximum likelihood algorithm is not 
presented because it can be viewed as a simplification of 
the algorithms here. 



A. Variational Approximation with 
Kullback-Leibler Divergence 

This approximate method was first applied to the se- 
quential variant of the Dirichlet-multinomial version of 
the problem by BNJ03]. A fuller treatment of these 
va riational metho ds for the exponential family is given 
in |GB0lllun03. 

In this approach a factored posterior approximation is 
made for the latent variables: 

p{hV\w,^,a,®,K,GV) « q{l,V) = qi{l)qviV) 

and this approximation is used to find expectations as 
part of an optimization step. The EM algorithm results 
if an equality holds. The functional form of the approxi- 
mation can be der ived by inspection of the recursive func- 
tional forms (see |Bun02J | Equation Q): 

qi{l) ex exp{Ev^g^{v)[^ogp{l,V,w\®,a,f3,K)]) 

qv{V) K exp(E,^,,(,)[logp(Z,v,u;|0,a,Ai^)])(-12) 



An important computation used during convergence in 
this approach is a lower bound on the individual docu- 
ment log probabil ities. T his naturally falls out during 
computation (see |BunO^ Equation Q). Using the ap- 
proximation q(l,V) defined by the above proportions, 
the bound is given by 

\ogp{w\&,cx,(3,K)> 
^i.v^qii.v) [log p (l, V,w\&, a, (3, K)] 

+I{qi{l))+I{qv{V)). 

The variational approximation applies to the Gamma- 
Poisson version and the Dirichlet-multinomial version. 



1, For the Gamma-Poisson Model: 

Looking at the recursive functionals of Equation H12() 
and the likelihood of Equation ©, it follows that qiQ 
must be K independent Gammas one for each compo- 
nent, and qvO must be J independent multinomials, one 
for each word. The most general case for the approxima- 
tion qO is thus 

h ^ Gamma(afc,5fc) 
{vj^i...k} ~ Multinomial(wj, {rij^ : fc = 1, . . . , K}) , 

which uses approximation parameters (ak,bk) for each 
Gamma and and n.^k (normalised as X^fc^-j.fe ~ 1) ^^^r 
each multinomial. These parameters form two vectors 
a, b and a matrix TV respectively. The approximate pos- 
terior takes the form qi{l \ a, b)qv{V \ N). 

Using these approximating distributions, and again 
looking at the recursive functionals of Equation H12|) . one 
can extract the rewrite rules for the parameters: 



rijh 



ak 



1 

— ( 

ak 



;feexp(E[log?fc]) , (13) 



6fe = 1 + Pk , 
where E [log k] = ^i^^p{i^ | a^.h^) [log h] 

= *o(afe)-logfefe , (14) 

Zj = ^0j-fcexp(E[logZfe]) . 
fc 

Here, ^o() is the digamma function, defined as — '^^J^ 
and available in most scientific libraries. These equations 
form the first step of each major cycle, and are performed 
on each document. 

The second step is to re-estimate the model parameters 
using the posterior approximation by maximising the 
expectation of the log of the full posterior probability 

Ei,v~<;;(09v(V) [log p {I, V,w,&\ a, /3, K)] . 

This incorporates Equation ^ for each document, and 
a prior for each fc-th column of of Dirichlet,7(7) (the 
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last model item in Section llV A|) . Denote the intermedi- 
ate variables rijf^ for the i-th document by adding a (i) 
subscript, as nj^k,{i), and hkewise for Wj,(i). All these log 
probability formulas yield linear terms in 9j^k, thus with 
the normalising constraints for one gets 



Sj,k oc ^w^j,(i)nj,fe,(,) +7i 



(15) 



The lower bound on the log probability of Equation (jl^fl . 
after some simplification and use of the rewrites of Equa- 
tion p3|l . becomes 



1 1 V^i r(afc)&?''' .r-^ 



ak -afe)E[log Ik] 
^WjlogZj . 



(16) 

The variational approximation algorithm for the 
Gamma-Poisson version is summarised in Figure |S| An 
equivalent algorithm is produced if words are presented 
sequentially instead of being bagged. 

a. Complexity: Because Step 2(a) only uses words 
appearing in a document, the full Step 2 is 0{SK) in time 
complexity where S is the number of words in the full col- 
lection. Step 3 is 0{JK) in time complexity. Space com- 
plexity is 0{IK) to store the intermediate parameters a 
for each document, and the 0{2JK) to store © and its 
statistics. In implementation. Step 2 for each document 
is often quite slow, and thus both a and the document 
word data can be stored on disk and streamed, thus the 
main memory complexity is 0{2JK) since the 0{S) and 
0{IK) terms are on disk. If documents are very small 
(e.g., S/I <^ K, for instance "documents" are sentences 
or phrases), then this does not apply. 

h. Correspondence with NMF: A precursor to the 
GaP m odel is non-negative matrix factorisation (NMF) 
|LS99j, which is based on the matrix approximation 
paradigm using KuUback-Leibler divergence. The algo- 
rithm itself, converted to the notation used here, is as 
follows 



^fc, 



'A;,(i) 



E 



^3,k 



W 



'j,(i) 



J2i^j.k J2k^J,k^k,{i) 



V],k 



hk E 



ffe,(i) 



"jX^ 



J2ih,{i) J2k^3,kh 



,w 



Notice that the solution is indeterminate up to a factor 
ipk- Multiply lk,{i) by ipk and divide Bj^k by tpk and the 
solution still holds. Thus, without loss of generality, let 
6j^k be normalised on j, so that ^ 9j,k = 1- 

Lemma 2. The NMF equations above, where & is re- 
turned normalised, occur at a maxima w.r.t. and I 
for the Camma-Poisson likelihood Y[i p{w(^i) | 0, Z(i) , a = 
Q,(3^Q,K,GP). 



Proof. To see this, the following will be proven. Take a so- 
lution to the NMF equations, and divide 9j^k by a factor 
ipk = '}Z-j^i,k, and multiply lk,(i) by the same factor. This 
is equivalent to a solution for the following rewrite rules 



l'k,(i) 



h.{i) 2_^ dj,k 



"j,{i) 



Gj,k OC Sj,fc 2_^ ^k.(i) 



Y,k^i-kh,(i) 

Y,k^jMk,{i) 



where Oj^k is kept normalised on j. These equations hold 
at a maxima to the likelihood Y\^p{w(^i•J \ @,l(^i-j,a. — 0,(3 = 
0,K, GP). The left equation corresponds to a maxima w.r.t. 
Z(i) (note the Hessian for this is easily shown to be negative in- 
definite), and the right is the EM equations for the likelihood, 
w.r.t. ©. 

To show equivalence of the above and the NMF equations, 
first prove the forward direction. Take the scaled solution to 
NMF. The NMF equation for lk,{i) is equivalent to the equa- 
tion for lk,{i) in the lemma. Take the NMF equation for dj^k 
and separately normalise both sides. The X^i 'fc,(0 term drops 
out and one is left with the equation for 6j^k in the lemma. 
Now prove the backward direction. It is sufficient to show 
that the NMF equations hold for the solution to the rewrite 
rules in the lemma, since 6j^k is already normalised. The NMF 
equation for lk,(i) clearly holds. Assuming the rewrite rules 
in the lemma hold, then 



Cj.fc — 



dj.k Y,i {lk,(i)'Wj,(i) /Y^k ^3,klk,(i) ) 
Ej 9j,k J2^ {lk,{i)Wj,(i) /Y.k ^jMk,(i) ) 

^j,k J2i {lk,(i)™J,(i) /J2k ^j,kh,(i) ) 

J2i h,{i) Yjj (^j,'='^j,(i) /Efc Gj,kik,(i) ) 

^J,k Ei {h,(i)Wj,(i) /J2k ^j,kh,(i) ^ 



(reorder sum) 



Thus the second equation for NMF holds. 



(apply rewrite) 



D 



Note, including a latent variable such as I in the like- 
lihood (and not dealing with it using EM methods) does 
not achieve a correct maximum likelihood solution for the 
expression Y\^ p{w(^i'^ \S,a = 0,(3 = 0,K, GP). In prac- 
tice, this is a common approximate method for handling 
latent variable problems, and can lead more readily to 
over-fitting. 



2. For the Dirichlet- Multinomial Model: 

The variational approximation takes a related form. 
The approximate posterior is given by: 

m ~ Dirichlet(a) 
{vj,i...k} ~ Multinomial (ifj , {nj,k : k — 1, . . . , K}) 
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1. Initialise a for each document. The uniform initiaUsation would be ak = (X^fc ^k + i) / K. Note N is not stored. 

2. Do for each document: 

(a) Using Equations l|18|l . recompute N and update a in place. 

(b) Concurrently, compute the log-probability bound of Equation l|16|l . and add to a running total. 

(c) Concurrently, maintain the sufficient statistics for 0, the total ^^ ^j,{i)^j,k,(i) for each j, k over documents. 

(d) Store a for the next cycle and discard N . 

3. Update using Equation l|15|l . normalising appropriately. 

4. Report the total log-probability bound, and repeat, starting at Step 2. 



FIG. 5: K-L Variational Algorithm for Gamma-Poisson 



This yields the same style update equations as Equa- 
tions l(T^ except that /3fc = 1 



riik 



1 



6*^,^ exp (E [log mfe]) , (17) 



Ofc = ak + 22 



Wjnj^k , 



where E [log m*.] = E,„^^p(,„^ |„) [log toj.] 



Zj = ^ej,fcexp(E[logmfe]) . 



Equation (|15|l is also the same. The lower 

bound on the individual document log probabilities, 
logp {w\&,a, K, DM) now takes the form 



log (C^)- log 



r(Efcafc)nfcrK) 
r(Efc«fe)nferK) 

+ ^(afc - afc)E [log nik] + ^ Wj log Zj . 



(19) 



The correspondence with Equation (|16|l is readily seen. 
The algorithm for Dirichlet-multinomial version is re- 
lated to that in Figure [3 Equations H17(l replace Equa- 
tions (|13|l , Equation H19|) replaces Equation H16() , and the 
initialisation for a^ should be 0.5, a Jeffreys prior. 



B. Direct Gibbs Sampling 

There are two styles of Gibbs sampling that apply to 
DCA. The first is a basic Gibbs sampli ng first p roposed 
by Pritchard, Stephens and Donnelly |PSDO0f . Gibbs 
sampling is a conceptually simple method. Each unob- 
served variable in the problem is resampled in turn ac- 
cording to its conditional distribution. We compute its 
posterior distribution conditioned on all other variables, 
and then sample a new value for the variable using the 
posterior. For instance, an ordering we might use in this 



the low level sampling in this section use well known dis- 
tributions such as gamma or multinomial, and are avail- 
able in standard scientific libraries. 

To develop this approach for the Gamma-Poisson, look 
at the full posterior, which is a product of individual 
document likelihoods with the prior for from the last 
model item in Section IIV Al The constant terms have 
been dropped. 



n n 



Pf-l^i^ exp{-(/3, + l)?fc,(.)} 9\ 



r(afc) 



j,k 



Vj.k.{i) 



problem is: In), V 



(1), V(i), 



^(2),V(2), 



' hl)'^{I)^ 



&. All 



(20) 

Each of the conditional distributions used in the Gibbs 
sampling are proportional to this. The first conditional 
distribution is p{l(^i) \'V(^i),j3,a,&,K,GP). From this, 
isolating the terms just in i(i), we see that each lk,{i) i^ 
conditionally gamma distributed. Likewise, each v.^^i) 
is multinomial distributed given Lj) and 0, and each 9.k 
is Dirichlet distributed given all the Z(j) and V{i) for each 
i. The other models are similar. An additional effort is 
required to arrange the parameters and sequencing for 
efficient use of memory. 

The major differentiator for Gibbs sampling is the re- 
sampling of the latent component vector I. The sam- 
pling schemes used for each version are given in Ta- 
ble mil Some care is required with the conditional 
Gamma-Poisson. When Ck = 0, the sampling for Ik needs 
to decide whether to use the zero case or the non-zero 
case. This uses Equation (|SJ| to make the decision, and 
then resorts to Equation © if it is non-zero. 

The direct Gibbs algorithm for the general case is given 
in Figure This Gibbs scheme turns out to correspond 
to the variational approximation, excepting that sam- 
pling is done instead of maximisation or expectation. 

The log probability of the words w can also ac- 
cumulated in step 1(c). While they are in terms 
of the latent variables, they still represent a rea- 
sonably unbiased estimate of the likelihoods such as 
P (wi^i) , . . . , 10(7) \cx,f3,&,K, GP) . 
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1. For each document i, retrieve the last C(i) from store, then 

(a) Sample the latent component variables Z(i) (or its normalised counterpart m^i^) as per 
Table irm 

(b) For each word j in the document with positive count w-,,(i), the component counts vector, 
from Equation JSJ and Equation JUJ, 

{vj,k,{i) '■ k — 1, . . . , K} ~ Multinomial I 1^^,(1), < ^^ '. '- — : k 



Y,k^k,{i)Sj.k 

Alternatively, if the sequence-of-components version is to be used, the component for each 
word can be sampled in turn using the corresponding Bernoulli distribution. 

(c) Concurrently, accumulate the log-probability p (loji) | Zj^), a, /3, ©, if, GP) , 
p{i"{i) l'(i),a,/3,P, 0,-^", CGP), or p(iy(i) | m(i), L(,),a,/3, 0,iC, DM). 

(d) Concurrently, maintain the sufficient statistics for 0, the total ^^ ^j.fc.CO for each j, k over 
documents. 

(e) Store cj^j for the next cycle and discard V{i)- 

2. Using a Dirichlet prior for rows of 0, and having accumulated all the counts V(i) for each 
document in sufficient statistics for 0, then its posterior has rows that are Dirichlet. Sample. 

3. Report the total log-probability, and report. 



FIG. 6: One Major Cycle of Gibbs Algorithm for DCA 



Model 


Sampling 


GP 


h ~ Gamma(cfe +ak,l + Pk)- 


CGP 


If Cfe = 0, then Conditional Gamma-Poisson with rate 
(i-..a^:^:a:..)-^ and Gamma(a„ 1 + A). If 
Cfe / 0, revert to the above Gamma-Poisson case. 


DM 


m ~ Dirichlet({cfe -I- aj, : k}). 



TABLE III: Sampling components for direct Gibbs on a single 
document 



C. Rao-Blackwellised Gibbs Sampling 



Rao-Blackwellisation of Gibbs sampling |CR96l | com- 
bines closed form updates of variables with Gibbs sam- 
pling. It does so by a process called marginalisation or 
variable elimination. When feasible, it can lead to signif- 
icant improve ments , the general case for DCA. Griffiths 
and Steyvers |GS04^ introduced this algorithm for LDA, 
and it easily extends to the Gamma-Poisson model and 
its conditional variant with little change to the sampling 
routines. 

When using this approach, the first step is to consider 
the full posterior probability and see which variables can 
be marginalised out without introducing computational 
complexity in the sampling. For the GP model, look 
at the posterior given in Equation if^ . Equations 
shows that the Z(i)'s can be marginalised out. Likewise, 
can be marginalised out because it is an instance 
of a Dirichlet. This yields a Gamma-Poisson posterior 
p ( V(i) , . . . , V{i) \a,j3,K, GP) , with constants dropped: 



\ -l-l- (1 4- R,Yk,(i)+°'k 11 



, (l + /3.r'"<''+"'=— «,,fc,w! 



n 






(21) 



Below it is shown that a short sampling routine can be 
based on this. 

A similar formula applies in the conditional GP case 
using Equation (jS]) for the marginalisation of Z(i)'s. The 
first term with Y[k i'^ Equation 121|l becomes 

^ ^'^ r(afc) (i + /?fe)-^.w+"^ +^^-'=-.(-)=o- 

Likewise a similar formula applies in the Dirichlet- 
multinomial version using Equation Hll|l : 



n nr(c.. 



cM 






n 



Uj r (7j + Ei Vj,k. 



(22) 



(i)) 



r (Ej 7j + Ei Cfe,(») 



Here a term of the form T (Efe('^fe,(2) ^' '^k)) drops out 
because Efc Cfc,(j) — -^(j) is known and thus constant. 

Now the posterior distributions have been marginalised 
for each of the three models, GP, CGP and DM, a 
Gibbs sampling scheme needs to be developed. Each set 



{. 



j-.k.ii) 



k e l,...,K} sums to Wj(i\^ moreover the 
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Model 


Sampling Proportionality 


GP 


E,7.+E,Cfc 1 + A 


CGP 


When Cfc > use the proportionality of the GP case, 
and otherwise 


E, 7. + E. Cfc 1 + /3fe (1 - pfc)/3r +Pk{l + /3fe)°'= 


DM 


7j + Ei Vj,k , , ^ 
E,7j +E>Cfc 



TABLE IV: Sampling ki — k given j = j; for Rao- 
Blackwellised Gibbs 



Implementation notes: 

Due to Rao-Blackwellisation, both the i(i)'s and are 
effectively re-estimated with each sampling step, instead 
of once after the full pass over documents. This is most 
effective during early stages, and explains the superiority 
of the method observed in practice. Moreover, it means 
only one storage slot for is needed (to store the suf- 
ficient statistics), whereas in direct Gibbs two slots are 
needed (current value plus the sufficient statistics). This 
represents a major saving in memory. Finally, the lu^'s 
and can be sampled at any stage of this process (be- 
cause their sufficient statistics make up the totals appear- 
ing in the formula), thus Gibbs estimates for them can 
be made as well during the MCMC process. 



forms of the functions in Equations (|21|l and (|22|l are 
quite nasty. A way out of this mess is to convert the 
scheme from a bag of words model, implicit in the use of 
V ti) and iy(i), to a sequence of words model. 

This proceeds as follows. Run along the Lu\ words in a 
document and update the corresponding component as- 
signment for each word. Gomponent assignments for the 
j-th document are in a L(j) -dimensional vector fej,) , where 
each entry takes a value from 1, . . . ,K. Suppose the l-th 
word has word index ji. In one step, change the counts 
{^ii,/c,(j) • k £ 1,...,K} by one (one is increased and 
one is decreased) keeping the total Wj^u\ constant. For 
instance, if a word is originally in component fci but up- 
dating by Gibbs sampling to ^2, then decrease I'j, ,fci,(i) by 
one and increase Wj, ,/c2,(i) by one. Do this for Lu\ words 
in the document, for each document. Thus at word I 
for the j-th document, we sample component assignment 
fc; (i) according to the posterior for fc; (^^ with all other 
assignments fixed. This posterior is proportional to (the 
denominator is a convenient constant) 



p{V I sequential, a, /3,iir, GP)|, 



'ji,k,(i)*-Vji,k,(i)+ik^k, 



p{V \ sequential, a,f3,K, GP) |^ 



'j,,fc,,(>)^i'j,,fc,,(>)- 



where the notation "sequential" is added to the right- 



hand side because the combinatoric terms v 



'j,k,(i) 



! of 



Equation (|21|l need to be dropped. This formula sim- 
plifies dramatically because T{x + l)/r(a::) = x. 

Derived sampling schemes are given in Table Hvl The 
(j) subscript is dropped and assumed for all counts, and 
j — ji is the word index for the word whose component 
index is being resampled. Since ki is being sampled, a 
K dimensional probability vector is needed. The table 
gives the unnormalised form. 

This Rao-Blackwellised Gibbs algorithm is given in 
Figure[7| As before, an approximately unbiased log prob- 
ability can be recorded in Step 2(c). This requires a value 
for 0. While the sufficient statistics could be used to sup- 
ply the current mean estimate for 0, this is not a true 
sampled quantity. An alternative method is to make a 
sample of in each major cycle and use this. 



D. Historical notes 

Some previous algorithms can now be placed into con- 
text. 



NMF: Straight maximum hkclihood, e.g. in |LS99J| . 
expressed in terms of KuUback-Leibler divergence 
minimization, where optimisation jointly applies to 
the latent variables (see Section IVII A lb|l . 



PLSI: Annealed maximum likelihood |Hof99l | . best 
vie wed in terms of its clustering precursor such as 

byima. 

Various Gibbs: Gibbs sampling on V(i), i(i)/m(i) and 
i n turn using a full probability distribution 
by |PSDO0l |. or Gibbs samphng on V(,;) alone 
(or equivalently, component assignments for words 
in the sequence of words representation) after 
marginalising out Z(,j)/m(i) and by |GS04| . 

LDA: variational a pproxim ation with KuUback-Leibler 
divergence by |BNJ03| |. a significant introduction 
because of its speed. 



Expectation propagation |ML02J | requires 0{KS) latent 
variables stored, a prohibitive expense compared to the 
0(5*) or 0{KI) of other algorithms. Thus it has not been 
covered here. 



E. Other Aspects for Estimation and Use 

A number of other algorithms are needed to put these 
models into regular use. 

a. Component parameters: The treatment so far 
has assumed the parameter vectors a and j3 are given. 
It is more usual to estimate these par ameters with th e 
rest of the estimation tasks as done by |BNJ03l ICan04| . 
This is feasible because the parameters are shared across 
all the data, unlike the component vectors themselves. 
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1. Maintain the sufficient statistics for 0, given by '}2i'"3,k.(') ^'^'" ea^ch j and k, and the sufficient 
statistics for the component proportions /(jj/mjj) given by Cf^j. 

2. For each document i, retrieve the L(i) component assignments for each word then: 

(a) Recompute statistics for l(i^/m(^i-) given by Ck^{i) = Yli ^j,fc,(i) ^o'' each k from the individual 
component assignment for each word. 

(b) For each word I with word index ji and component assignment ki in the document, resample 
the component assignment for this word according to the marginalised likelihoods in this 
section. 

i. First decrement Uj;,fc,,(i) and Ci;,,{i) by one to remove the component assignment for 

the word, 
ii. Sample ki — k proportionally as in Table HVI 
iii. Increment «j,,fe,,(i) and Cs.,,(i). 

(c) Concurrently, record the log-probability such as p (t«(i) | V (i) , ct, f3, ©, K, GP) for the ap- 
propriate model. 

(d) Concurrently, update the sufficient statistics for l^if/m^i-) and 0. 



FIG. 7: One Major Cycle of Rao-Blackwellised Gibbs Algorithm for DCA 



b. Estimating the number of components K: The 
number of components K is usually a constant assumed 
a priori. But it may be helpful to treat as a fittable pa- 
rameter or a random variable that adapts to the data. 
In popular terms, this could be used to find the "right" 
number of components, though in practice and theory 
such a thing might not exist. To obtain best-fitting K , 
we can employ cross-validation, or we assess the evidence 
(or marginal likelihood) for the model given a particular 
choice of K CC95, BJ04]. In particular, evidence is the 
posterior probability of the data given the choice of K 
after all other parameters have been integrated out. 



c. Use on new data: A typical use of the model re- 
quires performing inference related to a particular doc- 
ument. Suppose, for instance, one wished to estimate 
how well a snippet of text, a query, matches a document. 
Our document's components are summarised by the la- 
tent variables m (or I). If the new query is represented 
by q, then p{q\m,S,K,GP) is the matching quantity 
one would like ideally. Since m is unknown, we must 
average over it. Various methods have been proposed 
[ML02,.„ B,in4j . 



d. Alternative compo nents: Hierarchical compo- 
nents have been suggested 'BJ04J as a way of organising 
an otherwise large flat component space. For instance, 
the Wikipedia with over half a million documents 
can easily support the discovery of several hundred 
components. Dirichlet processes have been developed 
as an alternative to the ii'-dimensional component 

i )riors i n the Dirichlet-multinomial/discrete model 
YYTOSJ I , although in implementation the effect is to use 
Jf-dimensional Dirichlets for a large K and delete low 
performing components. 



VIII. APPLICATIONS 

This section briefly discusses two applications of the 
methods. 



A. Voting Data 

One type of political science data are the roll calls. 
There were 459 roll calls in the US Senate in the year 
2003. For each of those, the vote of every senator was 
recorded in three ways: 'Yea', 'Nay' and 'Not Voting'. 
The outcome of the roll call can be positive (e.g.. Bill 
Passed, Nomination Conflrmed) corresponding to 'Yea', 
or negative (e.g.. Resolution Rejected, Veto Sustained). 
Hence, the outcome of the vote can be interpreted as 
the 101st senator, by associating positive outcomes with 
'Yea' and negative outcomes with 'Nay'. 



1. Application of the Method: 

We can now map the roll call data to the DCA frame- 
work. For each senator X we form two 'words', where 
wx.y implies that X voted 'Yea', and wx^n implies that 
X voted 'Nay'. Each roll call can be interpreted as a 
document containing a single occurrence of some of the 
available words. The pair of words wx,y,wx,n is then 
treated as a binomial, so the multivariate formulation of 
Section llV Dl is used. Priors for were Jeffreys priors, 
a was (0.1,0.1,. ..,0.1), and regular Gibbs sampling was 
used. 

Special-purpose models are normally used for inter- 
preting roll call data in political science, and they of- 
ten postulate a model of rational decision making. Each 
senator is modelled as a position or an ideal point in 
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a continuous spatial model of preferences |C JR04 '| . For 
example, the first dimension often delineates the liberal- 
conservative preference, and the second region or social 
issues preference. The proximities between ideal points 
'explain' the positive correlations between the senators' 
votes. The ideal points for each senator can be ob- 
tained either by optimization, for instance, with the opti- 
mal classifi cation a lgorithm PooOOj , or through Bayesian 
modeUing |C,TR04l . 

Unlike the spatial models, the DCA interprets the cor- 
relations between votes through membership of the sen- 
ators in similar blocs. Blocs correspond to latent com- 
ponent variables. Of course, we can speak only of the 
probability that a particular senator is a member of a 
particular bloc. The corresponding probability vector is 
normalized and thus assures that a senator is always a 
member of one bloc on the average. The outcome of the 
vote is also a member of several blocs, and we can inter- 
pret the membership as a measure of how influential a 
particular bloc is. 

Our latent senator (bloc) can be seen as casting votes 
in each roll call. We model the behavior of such latent 
blocs across the roll calls, and record it: it has a behavior 
of its own. In turn, we also model the membership of 
each senator to a particular bloc, which is assumed to be 
constant across all the blocs. 

A related family of approaches is based on modelling 
relations or networks using blocks or groups. There, a 
roll call would be described by one network, individual 
senators would be nodes in that network, and a pair of 
nodes is connected if the two senators agreed. Discrete 
latent variables try to explain the existence of links be- 
twee n entitie s in ter ms of senators' membership to blocks, 
e.g., |HLL8.1ISN9'^ . 

Several authors prefer the block-model approach to 
modelling roll call data WMM05]. The membership of 
senators to the same block along with a high probability 
for within-block agreements will explain the agreements 
between senators. While a bloc can be seen as having 
an opinion about each issue, a block does not (at least 
not explicitly). The authors also extended this model to 
'topics', where the membership of senator to a particu- 
lar block depends on the topic of the issue; namely, the 
agreement between senators depends on what is being 
discussed. The topic is also associated with the words 
that appear in the description of an issue. 



2. Visualization: 

We can analyze two aspects of the DCA model as ap- 
plied to the roll call data: we can examine the mem- 
bership of senators in blocs, and we can examine the 
actions of blocs for individual issues. The approach to 
visualization is very similar, as we are visualizing a set of 
probability vectors. We can use the gray scale to mirror 
the probabilities ranging from (white) to 1 (black). 

As yet, we have not mentioned the choice oi K - the 



number of blocs. Although the number of blocs can be 
a nuisance variable, such a model is distinctly more dif- 
ficult to show than one for a fixed K. We obtain the fol- 
lowing negative logarithms to the base 2 of the model's 
likelihood for K = 4,5,6,7,10: 9448.6406, 9245.8770, 
9283.1475, 9277.0723, 9346.6973. We see that X = 5 
is overwhelmingly selected over all others, with K — 4, 
being far worse. This means that with our model, we 
best describe the roll call votes with the existence of five 
blocs. Fewer blocs do not capture the nuances as well, 
while more blocs would not yield reliable probability esti- 
mates given such an amount of data. Still, those models 
are also valid to some extent. It is just that for a single 
visualization we pick the best individual one of them. 

We will now illustrate the membership of senators in 
blocs. Each senator is represented with a vertical bar of 5 
squares that indicate his or her membership in blocs. We 
have arranged the senators from left to right using the 
binary PCA approach of [de 03j . This ordering attempts 
to sort senators from the most extreme to the most mod- 
erate and to the most extreme again. Figure [S] shows the 
Democrat senators and Figure El the Republicans. 

We can observe that component 5 is the Democrat ma- 
jority. It is the strongest overall component, yet quite 
uninfluential about the outcome. Component 4 are the 
moderate Democrats, and they seem distinctly more in- 
fluential than the Democrats of the majority. Component 
3 is a small group of Republican moderates. Component 
2 is the Republican majority, the most influential bloc. 
Component 1 is the Republican minority, not very influ- 
ential. Component 1 tends to be slightly more extreme 
than component 2 on the average, but the two compo- 
nents clearly cannot be unambiguously sorted. 



IX. CLASSIFICATION EXPERIMENTS 

DCA is not trying to capture those aspects of the text 
that are relevant for distinguishing one class of docu- 
ments from another one. Assume our classification task 
is to distinguish newspaper articles on politics from all 
others. For DCA, modelling the distribution of a word 
such as 'the' is equally or more important than mod- 
elling the distribution of a highly pertinent word such as 
'tsunami' in classifying news reports. Of course, this is 
only a single example of sub-optimality. Nevertheless, if 
there are several methods of reducing the dimensional- 
ity of text that all disregard the classification problem at 
hand, we can still compare them with respect to classifi- 
cation performance. 

We used MPCA and tested its use in its role as a fea- 
ture construction tool, a common use for PCA and ICA, 
and as a classification tool. For this, we used the 20 
newsgroups collection described previously as well as the 
Reut ers-215 78 collection 0. We employed the SVM'*»''* 
V5.0 |Joa99| classifier with default settings. For classifi- 
cation, we added the class as a distinct multinomial (cf. 
Section FlV D|l for the training data and left it empty for 
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FIG. 8: Component membership for Democrats 
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FIG. 9: Component membership for Republicans 



the test data, and then predicted the cl ass val ue. Note 
that for performance and accuracy, SVM |Joa9 8] is a clear 
winner |LYRL04J . It is interesting to see how MPCA 
compares. 

Each component can be seen as generating a number 
of words in each document. This number of component- 
generated words plays the same role in classification as 
does the number of lexemes in the document in ordinary 
classification. In both cases, we employed the tf *idf 
transformed word and component-generated word counts 
as feature values. Since SVM works with sparse data 
matrices, we assumed that a component is not present 
in a document if the number of words that a compo- 
nent would have generated is less than 0.01. The com- 
ponents alone do not yield a classification performance 
that would be competitive with SVM, as the label has no 
distinguished role in the fitting. However, we may add 
these component- words in the default bag of words, hop- 
ing that the conjunctions of words inherent to each com- 
ponent will help improve the classification performance. 

For the Reuters collection, we used the ModApte split. 
For each of the 6 most frequent categories, we performed 
binary classification. Further results are disclosed in Ta- 
ble 2 Q. No major change was observed by adding 
50 components to the original set of words. By per- 
forming classification on components alone, the results 



were inferior, even with a large number of components. 
In fact, with 300 components, the results were worse 
than with 200 components, probably because of over- 
fitting. Therefore, regardless of the number of compo- 
nents, the SVM performance with words cannot be re- 
produced by component-generated words in this collec- 
tion. Classifying newsgroup articles into 20 categories 
proved more successful. We employed two replications 
of 5-fold cross validation, and we achieved the classi- 
fication accuracy of 90.7% with 50 additional MPCA 
components, and 87.1% with SVM alone. Comparing 
the two confusion matrices, the most frequent mistakes 
caused by SVM+MPCA beyond those of SVM alone were 
predicting talk. politics. misc as sci. crypt (26 errors) and 
talk. religion. misc predicted as sci. electron (25 errors). 
On the other hand, the components helped better iden- 
tify alt. atheism and talk. politics. misc, which were mis- 
classified as talk. religion. misc (259 fewer errors) earlier. 
Also, talk. politics. misc and talk. religion. misc were not 
misclassified as talk. politics. gun (98 fewer errors). These 
50 components were not very successful alone, resulting 
in 18.5% classification accuracy. By increasing the num- 
ber of components to 100 and 300, the classification accu- 
racy gradually increases to 25.0% and 34.3%. Therefore, 
many components are needed for general-purpose classi- 
fication. 
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TABLE V: SVM Classification Results 



SVM 


SVM+MPCA 


CAT ACC. P/R 


ACC. P/R 


earn 98.58 98.5/97.1 


98.45 98.2/97.1 


acq 95.54 97.2/81.9 


95.60 97.2/82.2 


moneyfx 96.79 79.2/55.3 


96.73 77.5/55.9 


grain 98.94 94.5/81.2 


98.70 95.7/74.5 


crude 97.91 89.0/72.5 


97.82 88.7/70.9 


trade 98.24 79.2/68.1 


98.36 81.0/69.8 




MPCA (50 comp.) 


MPCA (200 comp.) 


CAT ACC. P/R 


ACC. P/R 


earn 96.94 96.1/94.6 


97.06 96.3/94.8 


acq 92.63 93.6/71.1 


92.33 95.3/68.2 


moneyfx 95.48 67.0/33.0 


96.61 76.0/54.7 


grain 96.21 67.1/31.5 


97.18 77.5/53.0 


crude 96.57 81.1/52.4 


96.79 86.1/52.4 


trade 97.82 81.4/49.1 


97.91 78.3/56.0 



NMF, PLSI, LDA, MPCA and GaP. For instance, NMF 
with normalised results corresponds to an approximate 
maximum likelihood method for LDA, and GaP is the 
most general family of models. We have also presented 
the different algorithms available for three different cases, 
Gamma-Poisson, conditional Gamma-Poisson (allowing 
sparse component scores), and Dirichlet-multinomial. 
This extends a number of algorithms previous devel- 
oped for MPCA and LDA to the general Gamma-Poisson 
model. Experiments with the MPCA softwarejOj show 
that a typical 3GHz desktop machine can build models 
in a few days with K in the hundreds for 3 gigabytes of 
text. 

These models share many similarities with both PCA 
and ICA, and are thus useful in a range of feature en- 
gineering tasks in machine learning and pattern recog- 
nition. A rich literature is also emerging extending the 
model in a variety of directions. This is as much caused 
by the surprising performance of the algorithms, as it is 
by the availability of general Gibbs sampling algorithms 
that allow sophisticated modelling. 



From these experiments, we can conclude that com- 
ponents may help with tightly coupled categories that 
require conjunctions of words (20 newsgroups), but not 
with the keyword-identifiable categories (Reuters). Judg- 
ing from the ideas in ^JB03] , the components help in two 
cases: a) when the co-appearance of two words is more 
informative than the sum of informativeness of individ- 
ual appearances of either word, and b) when the appear- 
ance of one word implies the appearance of another word, 
which does not always appear in the document. 



X. CONCLUSION 

In this article, we have presented a unifying frame- 
work for various approaches to discrete component anal- 
ysis, presenting them as a model closely related to ICA 
but suited for sparse discrete data. We have shown the 
relationships between existing approaches here such as 
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[1] This is a distribution on integers where a rate is given 
for events to occur, and the distribution is over the total 
number of events counted. 

[2] By a change of coordinates 
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[3] The discrete distribution is the multivariate form of a 
Bernoulli where an index j £ {0, 1, . . . , J — 1} is sampled 
according to a J-dimensional probability vector. 

[4] Some advanced fitting methods such as Gibbs sampling 
do not treat the likelihood as a black-box. They introduce 
latent variables that expands the functional form of the 



likelihood, and they may update parts of a document in 
turn. For these, ordering effects can be incurred by bagging 
a document, since updates for different parts of the data 
will now be done in a different order. But the combinatoric 
term mapping bag to sequence representations will still be 
ignored and the algorithms are effectively the same up to 
the ordering affects. 

[5] Conventions for the gamma vary. Sometimes a parameter 
l//3fc is used. Our convention is revealed in Equation J^J. 

[6] The likelihood p{%u,V \l, P,ol,@, K,GP) can be treated 
with EM methods using the latent variable V and leaving 
I as if it was observed. 

[7] The Reuters-21578, Distribution 1.0 test collection is avail- 
able fr om David D. Lewis' professional hom e page, cur- 
rently: |http://www.research.att.com/~lewis| 

[8] The numbers are percentages, and 'P/R' indicates preci- 
sion/recah. 

[9] |http://www.componentan alysis.org] 



